Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{(N+1)}{2}\)
Following the instructions from above, here is my R syntax to generate a random variable X:
set.seed(123)
N <- 10
X <- runif(10000, 1, N)
Y <- rnorm(10000, (N+1)/2, (N+1)/2)
From the above syntax, you can see that I’ve created a variable \(X\) with 10,000 random uniform numbers from 1 to \(N\) (\(N=10\)). Additionally, I’ve created a random variable \(Y\) that has 10,000 random normal numbers with a \(\mu\) and \(\sigma\) of \(\frac{N+1}{2}\).
Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the median of the \(X\) variable, and the small letter \(y\) is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.
To work through this, I’ll first have to calculate \(x\) (the median) and \(y\) (the first quartile). Additionally, I’ve saved \(X\) and \(Y\) in a dataframe can stored the total number of rows (10,000) in a variable:
x <- quantile(X, 0.50)
y <- quantile(Y, 0.25)
df <- data.frame(X = X, Y = Y)
total_rows <- nrow(df)
a) \(P(X>x \ | \ X>y)\)
We can use the following equation to find the probability:
\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X>x) \ and \ B = P(X>y)\]
With this in mind, we can solve:
P_AandB <- nrow(df %>% filter(X > x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows
P_AgivenB <- P_AandB / P_B
P_AgivenB
## [1] 0.5509642
Solution: The probability \(P(X>x \ | \ X>y) = 0.5510\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 given this number is greater than 1.84 is 0.5510.
b) \(P(X>x, \ Y>y)\)
We can use the following equation to solve this portion:
\[P(A,B)=P(A) \times P(B) \\ where \ A=N(X>x) \ and \ B=N(Y>y)\]
With this in mind, we can solve:
P_AandB_2 <- nrow(df %>% filter(X > x & Y > y)) / total_rows
P_AandB_2
## [1] 0.3756
Solution: The probability \(P(X>x, \ Y>y) = 0.3756\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 and this number is greater than 1.84 is 0.3756.
c) \(P(X<x, \ X>y)\)
Similar to part a, we can use the following equation to solve:
\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X<x) \ and \ B = P(X>y)\]
With this in mind, we can solve:
P_AandB <- nrow(df %>% filter(X < x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows
P_AgivenB_2 <- P_AandB / P_B
P_AgivenB_2
## [1] 0.4490358
Solution: The probability \(P(X<x \ | \ X>y) = 0.4490\), which means that the probability that a uniform number from 1 to 10 is less than the median of 5.45 given this number is greater than 1.84 is 0.4490.
Investigate whether \(P(X>x \ and \ Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
To do this investigation, we’ll first have to calculate the joint probabilities:
We’ll calculate joint probabilities by using the equation:
\[P(A \cap B) = P(A \times B)\] Therefore, we first will create a few columns to distinguish whether our values meet the following conditions:
Then, we can take the count of each condition, and multiply the four conditions to get their respective joint probabilities:
Below is the R syntax I used to do this, by using group_by() and summarise() functions to do the calculations with the tidyverse package:
# Joint probabilities
prob_df <- df %>%
mutate(A = ifelse(X > x, " X > x", " X < x"), B = ifelse(Y > y, " Y > y", " Y < y")) %>%
group_by(A, B) %>%
summarise(count = n()) %>%
mutate(prob = count / total_rows)
Next, we need to back up a bit and examine the marginal probabilities of \(P(A)\) and \(P(B)\) independently:
Intuitively we know that \(x = median(X)\), therefore, half of the values of \(X\) > \(x\) and the other half of \(X\) < \(x\). Similarly, since we know that \(y = first \ quartile \ of \ Y\), a quarter of \(Y\) < \(y\) and the remaining three fourths of the values of \(Y\) > \(y\). Therefore, the marginal probabilities are:
We can check these assumptions by reworking our table:
prob_df <- prob_df %>%
ungroup() %>%
group_by(A) %>%
summarise(count = sum(count), prob = sum(prob)) %>%
mutate(B = "Total Prob") %>%
bind_rows(prob_df)
prob_df <- prob_df %>%
ungroup() %>%
group_by(B) %>%
summarise(count = sum(count), prob = sum(prob)) %>%
mutate(A = "Total Prob") %>%
bind_rows(prob_df)
And by printing our table, we can see that our marginal probabilities that we assumed are confirmed:
prob_df %>%
select(-count) %>%
spread(A, prob) %>%
rename("Conditions" = B) %>%
kable() %>%
kable_styling(bootstrap_options = c('striped'), full_width = FALSE)
| Conditions | X < x | X > x | Total Prob |
|---|---|---|---|
| Y < y | 0.1256 | 0.1244 | 0.25 |
| Y > y | 0.3744 | 0.3756 | 0.75 |
| Total Prob | 0.5000 | 0.5000 | 1.00 |
Solution: After generating the table above, it appears that \(P(X>x \ and \ Y>y)\) does equal \(P(X>x)P(Y>y)\), since they both have a value of 0.3756. By doing the calculations to check, \(P(X>x)P(Y>y) = (0.50)(0.75) = 0.375\) and \(P(X>x \ and \ Y>y) = 0.375\) as well.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
In order to run our Fisher’s Exact Test and Chi Square Test, we’ll first need to find the counts of each our conditions, similar to the previous question and create a two-by-two table:
X_plus <- nrow(df %>%
filter(X > x))
X_lesseq <- nrow(df %>%
filter(X <= x))
Y_lesseq <- nrow(df %>%
filter(Y <= y))
Y_plus <- nrow(df %>%
filter(Y > y))
freq_matrix <- matrix(c(X_plus, X_lesseq, Y_plus, Y_lesseq),
nrow = 2, ncol = 2, byrow = TRUE,
dimnames = list(c("x", "y"),
c("X > x; Y > y", "X <= x; Y <= y")))
With our frequency table ready to go, we can see our counts below:
freq_matrix %>%
kable() %>%
kable_styling(bootstrap_options = c('striped'), full_width = FALSE)
| X > x; Y > y | X <= x; Y <= y | |
|---|---|---|
| x | 5000 | 5000 |
| y | 7500 | 2500 |
Now, we are ready to run our Fisher’s Exact Test:
fisher.test(freq_matrix)
##
## Fisher's Exact Test for Count Data
##
## data: freq_matrix
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.3137986 0.3540659
## sample estimates:
## odds ratio
## 0.3333333
And here is the Chi-Square Test:
chisq.test(freq_matrix)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: freq_matrix
## X-squared = 1332.3, df = 1, p-value < 2.2e-16
When looking at both of the outputs, we can see that we reject the null hypothesise for both, given that the p-value is less than 0.05 for the 95 percent confidence interval. Since the Fisher’s Exact Test is typically used for smaller sample sizes, and our sample is of 10,000 observations, it would be more appropriate to use the Chi-Square Test.
Solution:
Our \(H_0\) = There is no relationship between X and Y.
Our \(H_a\) = There is a significant relationship between X and Y.
Given our outputs, independence does not hold when we run the Fisher’s Exact Test and Chi-Square test. We reject the null hypothesis that there is no relationship between X and Y. Since the Fisher’s Exact Test is typically used on smaller sample sizes, and we have a large sample size of 10,000 observations, it would be more appropriate to rely on the output from our Chi-Square Test.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
Descriptive and Inferential Statistics: Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
kaggle_df <- read_csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Final%20Exam/train.csv')
# summary(kaggle_df)
On a quick look at the summary, I decided to pull out the numeric values to explore a bit more:
numerics <- data.frame(summary(kaggle_df))
numerics <- numerics %>%
filter(!is.na(Freq))
numerics <- data.frame(numerics %>%
filter(!grepl('Class', Freq), !grepl('Length', Freq), !grepl('Mode', Freq)) %>%
separate(Freq, c('Value', 'Count'), sep = ":", remove = TRUE))
numerics$Var2 <- trimws(numerics$Var2)
numerics$Value <- trimws(numerics$Value)
numerics$Count <- round(as.numeric(trimws(numerics$Count)), 1)
numerics <- numerics %>%
select(Var2, Value, Count) %>%
spread(Value, Count) %>%
rename("Attribute" = "Var2", "Q1" = `1st Qu.`, "Q3" = `3rd Qu.`, "Max" = "Max.", "Min" = "Min.", "NA" = "NA's") %>%
select(Attribute, Min, Q1, Median, Mean, Q3, Max, `NA`)
kable(numerics) %>%
kable_styling(bootstrap_options = 'striped')
| Attribute | Min | Q1 | Median | Mean | Q3 | Max | NA |
|---|---|---|---|---|---|---|---|
| 1stFlrSF | 334 | 882.0 | 1087.0 | 1163.0 | 1391.0 | 4692 | NA |
| 2ndFlrSF | 0 | 0.0 | 0.0 | 347.0 | 728.0 | 2065 | NA |
| 3SsnPorch | 0 | 0.0 | 0.0 | 3.4 | 0.0 | 508 | NA |
| BedroomAbvGr | 0 | 2.0 | 3.0 | 2.9 | 3.0 | 8 | NA |
| BsmtFinSF1 | 0 | 0.0 | 383.5 | 443.6 | 712.2 | 5644 | NA |
| BsmtFinSF2 | 0 | 0.0 | 0.0 | 46.5 | 0.0 | 1474 | NA |
| BsmtFullBath | 0 | 0.0 | 0.0 | 0.4 | 1.0 | 3 | NA |
| BsmtHalfBath | 0 | 0.0 | 0.0 | 0.1 | 0.0 | 2 | NA |
| BsmtUnfSF | 0 | 223.0 | 477.5 | 567.2 | 808.0 | 2336 | NA |
| EnclosedPorch | 0 | 0.0 | 0.0 | 21.9 | 0.0 | 552 | NA |
| Fireplaces | 0 | 0.0 | 1.0 | 0.6 | 1.0 | 3 | NA |
| FullBath | 0 | 1.0 | 2.0 | 1.6 | 2.0 | 3 | NA |
| GarageArea | 0 | 334.5 | 480.0 | 473.0 | 576.0 | 1418 | NA |
| GarageCars | 0 | 1.0 | 2.0 | 1.8 | 2.0 | 4 | NA |
| GarageYrBlt | 1900 | 1961.0 | 1980.0 | 1979.0 | 2002.0 | 2010 | 81 |
| GrLivArea | 334 | 1130.0 | 1464.0 | 1515.0 | 1777.0 | 5642 | NA |
| HalfBath | 0 | 0.0 | 0.0 | 0.4 | 1.0 | 2 | NA |
| Id | 1 | 365.8 | 730.5 | 730.5 | 1095.2 | 1460 | NA |
| KitchenAbvGr | 0 | 1.0 | 1.0 | 1.0 | 1.0 | 3 | NA |
| LotArea | 1300 | 7554.0 | 9478.0 | 10517.0 | 11602.0 | 215245 | NA |
| LotFrontage | 21 | 59.0 | 69.0 | 70.0 | 80.0 | 313 | 259 |
| LowQualFinSF | 0 | 0.0 | 0.0 | 5.8 | 0.0 | 572 | NA |
| MasVnrArea | 0 | 0.0 | 0.0 | 103.7 | 166.0 | 1600 | 8 |
| MiscVal | 0 | 0.0 | 0.0 | 43.5 | 0.0 | 15500 | NA |
| MoSold | 1 | 5.0 | 6.0 | 6.3 | 8.0 | 12 | NA |
| MSSubClass | 20 | 20.0 | 50.0 | 56.9 | 70.0 | 190 | NA |
| OpenPorchSF | 0 | 0.0 | 25.0 | 46.7 | 68.0 | 547 | NA |
| OverallCond | 1 | 5.0 | 5.0 | 5.6 | 6.0 | 9 | NA |
| OverallQual | 1 | 5.0 | 6.0 | 6.1 | 7.0 | 10 | NA |
| PoolArea | 0 | 0.0 | 0.0 | 2.8 | 0.0 | 738 | NA |
| SalePrice | 34900 | 129975.0 | 163000.0 | 180921.0 | 214000.0 | 755000 | NA |
| ScreenPorch | 0 | 0.0 | 0.0 | 15.1 | 0.0 | 480 | NA |
| TotalBsmtSF | 0 | 795.8 | 991.5 | 1057.4 | 1298.2 | 6110 | NA |
| TotRmsAbvGrd | 2 | 5.0 | 6.0 | 6.5 | 7.0 | 14 | NA |
| WoodDeckSF | 0 | 0.0 | 0.0 | 94.2 | 168.0 | 857 | NA |
| YearBuilt | 1872 | 1954.0 | 1973.0 | 1971.0 | 2000.0 | 2010 | NA |
| YearRemodAdd | 1950 | 1967.0 | 1994.0 | 1985.0 | 2004.0 | 2010 | NA |
| YrSold | 2006 | 2007.0 | 2008.0 | 2008.0 | 2009.0 | 2010 | NA |
With my numerics dataframe ready, I can now set up some visuals by splitting the dataframe into two, one with all quantative variables, and the other with categorical data:
kaggle_numerics <- kaggle_df[, (colnames(kaggle_df)) %in% numerics$Attribute]
kaggle_categorical <- kaggle_df[, !(colnames(kaggle_df) %in% numerics$Attribute)]
kaggle_numerics %>%
dplyr::select(2:11) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
kaggle_numerics %>%
dplyr::select(12:21) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
kaggle_numerics %>%
dplyr::select(22:31) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
kaggle_numerics %>%
dplyr::select(32:38) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
After creating a few plot matrices with our quantitative data, I thought there were a few initial takeaways:
Many of the living spaces, and the square footage values seem to show stronger correlations with one another. This makes sense given that many times, the square footage of one floor will likely be similar square footage to another floor in the house. For instance, there’s a strong correlation between the first floor square footage and the basement square footage. When thinking about our regression later, it’s nice to know that there are multiple attributes here that could serve as proxies (if needed).
Additionally, I’m seeing a gradual, positive trend towards larger areas of garages the later they are built in the 20th century – for instance, garages built in the early part of the 20th century seem to be smaller than those that are built in the later 1900s and early 2000s. This may be something to look into further, and it’s relationship with the Sale Price later on.
It’s interesting to see too that the year the house is built has a relationship with the rating of overall material and finish quality of the house. Again, this make sense, but it would be valueable to see if this has a relationship with Sale Price.
Now, let’s explore the categorical variables a bit.
table(kaggle_categorical$LandSlope)
##
## Gtl Mod Sev
## 1382 65 13
# table(kaggle_categorical$BldgType)
table(kaggle_categorical$HouseStyle)
##
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
## 154 14 726 8 11 445 37 65
# table(kaggle_categorical$RoofStyle)
table(kaggle_categorical$RoofMatl)
##
## ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
## 1 1434 1 1 1 11 5 6
# table(kaggle_categorical$ExterCond)
table(kaggle_categorical$CentralAir)
##
## N Y
## 95 1365
# table(kaggle_categorical$Utilities)
table(kaggle_categorical$HeatingQC)
##
## Ex Fa Gd Po TA
## 741 49 241 1 428
table(kaggle_categorical$SaleType)
##
## COD Con ConLD ConLI ConLw CWD New Oth WD
## 43 2 9 5 5 4 122 3 1267
table(kaggle_categorical$SaleCondition)
##
## Abnorml AdjLand Alloca Family Normal Partial
## 101 4 12 20 1198 125
table(kaggle_categorical$Foundation)
##
## BrkTil CBlock PConc Slab Stone Wood
## 146 634 647 24 6 3
After looking through a large number of categorical values, and checking frequencies, a few things stood out to me:
It looks like there’s a majority of houses that are built on gentle slopes, however, there’s a proportion that are built on moderate or severe slopes – this could be something to look into further when building out the regression later on. It may be interesting to see if there is any correlation here with the Sale Price.
Similarly, the foundation material seems to be split heavily between cinder blocks and poured concrete. This is another feature that may be useful to explore later to see if this has an affect on housing prices, or if there is a connection between the year a house was built and which type of foundation was used.
Finally, Sale Type and Sale Condition were interesting values and their frequencies suggest that there isn’t a “one-size-fits-all” approach to certain transactions, making this an enticing set of features to explore further.
After plotting all of my quantitative variables earlier, I thought I’d hone in on a few that may be useful to explore their relationship with the Sale Price variable. You can find my plot below:
pairs(kaggle_df[, c('1stFlrSF', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', 'YearBuilt', 'SalePrice' )], pch = 19)
For our correlation matrix, I will use GrLivArea, 1stFlrSF, and SalePrice. And taking a few of these same attributes, I thought it would be interesting to plot these in a correlation matrix:
kaggle_numerics %>%
dplyr::select(`1stFlrSF`, `GrLivArea`, SalePrice) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
corr_matrix <- kaggle_numerics %>%
select(`1stFlrSF`, GrLivArea, SalePrice) %>%
cor() %>%
as.matrix()
As we can see from above, all three attributes, the square footage of the first floor, the above grade (ground) living area and the Sale Price all seem to have a positive relationship with one another. All three distributions also appear to be unimodal and right-skewed. Below is the official correlation matrix:
kable(corr_matrix) %>%
kable_styling(bootstrap_options = 'striped', full_width = FALSE)
| 1stFlrSF | GrLivArea | SalePrice | |
|---|---|---|---|
| 1stFlrSF | 1.0000000 | 0.5660240 | 0.6058522 |
| GrLivArea | 0.5660240 | 1.0000000 | 0.7086245 |
| SalePrice | 0.6058522 | 0.7086245 | 1.0000000 |
Now, although these correlations seem to be quite strong, we can do some hypothesis testing on their relationships at an 80-percent confidence interval:
corr1 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$`1stFlrSF`, method = 'pearson', conf.level = 0.80)
corr2 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)
corr3 <- cor.test(kaggle_numerics$`1stFlrSF`, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)
corr1
##
## Pearson's product-moment correlation
##
## data: kaggle_numerics$SalePrice and kaggle_numerics$`1stFlrSF`
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5841687 0.6266715
## sample estimates:
## cor
## 0.6058522
corr2
##
## Pearson's product-moment correlation
##
## data: kaggle_numerics$SalePrice and kaggle_numerics$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
corr3
##
## Pearson's product-moment correlation
##
## data: kaggle_numerics$`1stFlrSF` and kaggle_numerics$GrLivArea
## t = 26.217, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5427732 0.5884078
## sample estimates:
## cor
## 0.566024
Solution: After running the three pairwise correlations between my variables of SalePrice, 1stFlrSF, and GrLivArea, I can see from the outputs that all three accept the alternative hypothesis that the true correlation is not equal to zero. Therefore, this rejects the null hypothesis the correlations between each pairwise set of variables is zero. I know this to be true since the p-values for all three of my Person’s correlation tests are less than 0.05.
Solution: Additionally, since we are dealing with a large sample size, and our correlation tests are yielding very small p-values, all less than 0.001, I’m not worried about family-wise error when measuring relationships across these three attributes. With p-values this low, it creates a solid argument that we aren’t running into a Type 1 error here.
Linear Algebra and Correlation: Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.